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Abstract 

The entropy of an ergodic finite-alphabet process can be computed from a single 

^^ ' typical sample path x" using the entropy of the fc-block empirical probability and 

n , ■ letting k grow with n roughly like logn. We further assume that the distribution 

• ' of the process is a g-measure. We prove large deviation principles for conditional, 

"^ , non-conditional and relative fc(n)-block empirical entropies. 
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1. Introduction 

A problem of interest is the entropy-estimation problem. Given a sample path 
a;i, a;2, ..., a;„ (where the x^'s are drawn from a finite alphabet A) typical for an 
unknown ergodic source, how to estimate its entropy? The simplest idea is to use a 
"plug-in" estimator. First one computes for each block of length k, the fc-marginals 
of the source as the limit, when n — > cxi, of the k-block empirical probability of the 
sample a;"; then one can compute the fc-block entropy of the source and let fc — > oo 
to get the entropy of the source. A natural question is thus: how is it possible to 
choose k = k{n) to do these two steps at the same time? Ornstcin and Weiss j2] 
(see also |^) proved that this is indeed possible for any ergodic source of positive 
entropy if k does not grow 'too fast' with n, loosely like logn. The proof is based 
on an 'empirical version' of Shannon-McMillan-Breiman Theorem. 

A first result about fiuctuations of fc(n)-block empirical entropies, refining Ornstein- 
Weiss' almost-sure result, was obtained in [16) . In that paper the authors consider 
chains of infinite order which loose memory exponentially fast. Under additional 
restrictions on the sequence k{n) they prove a central limit theorem for the condi- 
tional fc(n)-block empirical entropy and they prove also that the rescaled fc(n)-block 
empirical entropy cannot have Gaussian fluctuations. 

In the present paper, we are interested in large deviations for fc(7i)-block empirical 
entropies. To this end we assume that the distribution of the process generating the 
sample path x" is a g- measure for the potential (j) = log g (see below for definitions 
and references). Such a process can be viewed as (a special case of) a chain with 
complete connections or a chain of infinite order, see e.g. [VAX I14j . Another way, 
especially useful for our concern, to characterize and describe a g-measure is as a 
one-dimensional equihbrium state [TKll^ . 

In this setting, we prove large deviation principles for conditional, non-conditional 
and relative entropies of the fc(n)-block empirical probability of the sample path 
x", when k{n) grows, roughly speaking, like logn. This is done for any ^-measure. 

When the block length k is fixed, it is easy to obtain a large deviation principle 
for fc-block empirical entropies by "contraction" of the large deviation principle for 
the empirical process [B] . This is possible because fc-block entropies are continuous 
in the weak topology. To prove the result when k{n) grows with n we will generalize 
some classical combinatorial techniques. We will use the combinatorics of types to 
see "how fast we can let k grow with n" , and get a condition close to Ornstein- Weiss' 
one. 

The rate functions we obtain are convex and we will compute also their Legendre 
transform which coincide with the corresponding scaled cumulant generating func- 
tions. This will allow to derive some properties of the rate functions and an explicit 
representation in some cases. 

Let us notice that the rate function we obtain for conditional and rescaled non 
conditional empirical entropy can have a linear part. This unexpected feature is 
related to the entropy of zero-temperature limit of equilibrium states which can be 
in general nonzero. 

Let us briefiy mention that around the problem of entropy estimation other 
techniques and ideas have been developed. The "plug- in" estimator is only one 
among several other entropy estimators, see e.g. jZ| IHl 123 12^]' We point out 
that we could have worked in the context of one-dimensional Gibbs measures. An 
interesting issue is the case of multi-dimensional Gibbs measures since we can no 
longer use the combinatorics of types. 

The present paper is organized as follows. In the next section we record prelimi- 
nary definitions and notions, in particular on ^-measures and the various entropies 
under study. In Section |3| we present our main results. In Section 0] we discuss our 



results, in particular the form of the rate functions that we obtain for empirical 
entropies. Section is devoted to the collection of combinatorial tools needed to 
understand "how fast k can grow with n" later on. Section El contains the proof of 
the main results. 



2. Preliminary definitions and notions 

def 

Let A he a. finite alphabet. We will denote by af — (ai,a2, ...) the elements 
of A^ and by a^ the finite string (ai, ...,ak)- We will use the notation Xi for a 
"sample path" (cci, 2:2, ..., a;„), Xi G A. We denote by T the "shift" operator defined 
as Tx J° = x^ . The cylinder set [a"] is the set of infinite strings b"^ drawn from 
A^ such that b'l = a'/. 

We call A^*"' the set of probability measures i^k on A'^ and Ai'^ the set of proba- 
bility measures i/k on A^ which satisfy the following stationarity condition 

J2 M^t'b) = J2 Mbaf) Va^i G A''-' . (2.1) 

b£A bGA 

The subset A^^' is convex and £'^ denotes the set of its extremal elements. 

We call Ai the set of probability measures 1/ on A^ with the usual sigma-algebra 
of cylinders. The subset of shift-invariant (or stationary) measures is denoted by 
Ms- The set of ergodic measures (the extremal points oi A4s) is denoted by £. 

Given a measure z^ S Als we will write v^ for its /c-marginals. Of course we have 
the identity J^fe(aJ) = i^([a5^]) for any Gi £ A'^ and consequently i^k G A4^. 

2.1. 5-measures and equilibrium states. In this paper we deal with g-measures 
associated to continuous and regular g-functions. We refer the reader to J19ll2()lf^ 
for full details about the following material. 

Let g he a. continuous function on A^ satisfying 

Y^ g{b^) = 1 for all a^ eA^^. (2.2) 

We further assume that g is strictly positive (this implies g < 1 by (|2.2|l ). We 
associate to such a function a potential, normalized according to l|2.2(l . by setting 

0'=log5- (2.3) 

Observe that < 0. A g-measure can be defined as an equilibrium state for the 
potential (f>. We measure the continuity of <j> by the sequence of its variations 

{va.Tm{(j)))m£N- 

var„(</.) =^ sup{|0(a?°) - HbT)\ ■ < = b^} . (2.4) 

Notice that (uniform) continuity of (p (with respect to the canonical distance nietriz- 
ing product topology) is equivalent to Ya,Tm{(t>) ^ as m — > 00. 

It is well-known that if varm(0) decreases to fast enough, then there is a 
unique g-measure which is the unique equilibrium state for (j). For instance, if this 
decreasing is exponential or more generally summable (26j . On another hand, an 
example of non-uniqueness was given by Bramson and Kalikow HI. In that example, 
var,„ {(f)) > . . Very recently the authors of ,2 showed that square-summability of 
variations, ensuring uniqueness |18|. is tight. Let us mention a uniqueness criterion 
based on a "one-sided" Dobrushin condition involving oscillations of the potential 
instead of variations |^| . 



From now on, we fix one of the g-measures associated to cj) and denote it by p. 
For all n > 1 and aj° G ^^, we have the following property 

p(K]) < g„s„ (2.5) 



exp(E;JiV(af) 

where (£«)« is a sequence of non-negative real numbers decreasing to 0. 

For k > 2, let p*^'^^ be the {k — l)-step Markov approximation of p, that is, the 
(unique) equilibrium state of the cylindrical potential 

MaT) = Mai) '^' log ^■ 

When k — 1, p^^'' is the Bernoulli measure for the potential 0i(aj°) — (/)i(ai) = 
logp(ai). We can see 0^ also as a function on A'^. 
We have the following property 

N-^fc lloo<varfc(0). (2.6) 

This implies the statement that for all af £ A^ 



^pMH 



li- 108^73^ ^'^(^r) 



uniformly. 

We shall use the variational principle repeatedly. Let tp : A^ ^ M be a continuous 
function. Then: 

sup{E^[^] + h{r]) : rj e Ms} = PtopW • (2.7) 

Moreover, the supremum is attained if and only if 77 is an equilibrium state of ip. 
Ptopi^) is the topological pressure of V"- It is defined as 

PtopW - J|m i log J2 exp ( sup { ^ ^(6f ) : b e K] }) . (2.8) 

a" J = l 

Coming back to a normalized potential (j) ~ logg, we have Ptopi'f) — 0. This 
can be seen, for instance, by plugging H2.5|l in H2.8|l . The variational principle then 
tells us that 

h{p) = -¥.p[4>]. (2.9) 

In particular, the entropy of a 5-nieasure is always strictly positive. 

We shall also consider multiples of the potential (/), that is potentials of the form 
/3(/), /? e M. When /? 7^ 1, such potentials have no reason to be normalized as (/) 
is, i.e. the corresponding equilibrium states are not g-measures. But this does not 
matter for us in the sense that we will only deal with equilibrium states of (3(1) that 
we will indicate with 



Remark. A 5-measure is also named a chain of infinite order or a chain with 
complete connections, see e.g. ^31, QH for recent accounts. See also ^Zj. In 
probabilistic terms, a chain of infinite order, or a chain with complete connections, 
is a process characterized by transition probabilities that depend on the whole past 
in a continuous manner. A g-measure can also be interpreted as a one-dimensional 
Gibbs measure if the variations go to exponentially fast JS] . 



2.2. Entropies. The /c-block (fc > 1) Shannon entropy is defined as 
The conditional /c-block {k > 2) entropy is defined as 

t^ ^([fli ]) ^ 

where i'fe(afc|aj^^) is the conditional probabihty i'kiai)/J2b'^kiai^^b). We have 
the relation 

def def 

where by convention we set Hq{v) = 0. Hence hi{v) = Hi(v). Note that /ifc(-) is 
a concave function on A^ ''" . 

It is well-known that if i/ is a stationary measure, then 

lim hk{v) = lim — ; = h{L') 

where h^iy) is the (Shannon-Kolmogorov-Sinai) entropy of v. 

The fc-block (fc > 1) relative entropy of a stationary measure v with respect to 
a (7-measure p is defined as 

^ P(K]) ^ Pfe(al) 

The map Dk{-\pk) is convex on Ai''. The conditional fc-block (fc > 1) relative 
entropy is defined as 

Afe(i/|p) =i:)fe(t/|p) - Dk-i{iy\p) = Ak{iyk\pk) ^Dk{i^k\pk) - Dk-iivk-ilPk-i) ■ 

Where we set 130(^1 p) — 0. This imposes Ai(t/|p) = Di{i^\p). 

The relative entropy h^i^lp) between v G A4s and a g-measure p is defined as 

h{v\p)'^^ lim ii:)fc(i^|p) = lim Afc(i/|p) and /i(i/|p) = -E^[0] - /i(iy) . (2.10) 

fc — *oo h k — 'oo 

By the variational principle, it is obvious that h{v\p) — if, and only if, v is an 
equilibrium state of cj). (See [Sj for more details.) 

2.3. Empirical measures and entropies. Given a finite string (a "sample path") 
x" we define the empirical measures 



n 



iTk{.al-x^) = -KkAal) = ^'=' ' , ^ G N 



where xf^ E A^ is the periodic, with period n, sample path (a;"a:;"x" • • • )■ 

It is easy to see that 7rfe(-;a;") S M^- The family of probability measures 
(7rfe(-;x"))j,gpj is consistent in the sense that 

E'r,(ai;xr) =7r,_i (ar';^r) , jeN 

aj 

and are the marginals of the empirical process 7r(-; x") defined as 

1 " 
7r{S;xr)^i.^ -J2ST-,r(S) (2-11) 

where S is any measurable subset of A^ . 

We can now define the following plug-in estimators for entropies. 



Definition 2.1. Let x^ G A" he a sample path. The k-block empirical entropy is 
defined as 

The conditional k-block empirical entropy is defined as 

The relative k-block empirical entropy with respect to a measure p is defined as 

Dkix'llp)^^ Dk{TTk{-,x'l)\pk). 

The relative conditional k-block empirical entropy with respect to a measure p is 
defined as 

Ak{x",\p)''= M^k{-;x'l)\Pk). 

3. Main results 

We are now ready to state the main results of this paper. 

Theorem 3.1 (Large deviation principles for empirical entropies). Let x" be a 
sample path distributed according to a g-measure p. Assume that ik{n))^^^ diverges 
and eventually satisfies 

for some < £ < 1. Then the conditional empirical entropy /ifc(„)(a;i) satisfies the 
following large deviation principle: 
For any closed set C C M 

limsup — log pi x" : /ifc(„)(x") £ c\ < - inf{I(u) : u e C} . 

For any open set O C M 

liminf-logpja;? : /ife(„)(a;?) G o| > - inf{I(w) : u £ O} 

n^ca n y. J 

where the convex rate function I is defined as 

l(u) = ( inf {'^('^Ip) ■■^(^Ms: h{v) ^ u) uG [0, log \A\] 
[_ +CX) otherwise . 

The same large deviation principle holds if we replace ^fc(„)(a;") by the rescaled 
empirical entropy ''w\ ^ . 



Theorem 3.2 (Large deviations for empirical relative entropies). Let x" be a sam- 
ple path distributed according to a g-measure p. Suppose that {k{n))^^-^ diverges 
and eventually satisfies k{n) < i^^\% logn, for some < e < 1. Then the em- 
pirical relative entropies A/j(„)(a;"|p) and -^T^Df.f„\{xi\p) satisfy a large deviation 
principle as in Theorem \S.l\ but with the rate function 

J, , r u u e [0, - inf{E,,[(/)] : ?7 e f }] ,^ g. 

1 +00 otherwise . 

These theorems are proved in Section |H| Their proof relies in an essential way 
upon combinatorial properties of types and a continuity property of entropy which 
are established in Sectional 

The following proposition deals with the case of fixed block length. The preceding 
theorems extend this proposition to the case when k{n) is allowed to grow with n 
according to H3.1() . 



Proposition 3.3 (Large deviations for fixed block length). Let x" be a sample 
path distributed according to a g-measure p. Then, for each k > 1, the empirical 
entropies jHk{xi), hk{xi), j:Dk{xi) and Afc(a;") satisfy a LDP with normalizing 
factor — and rate functions respectively given by 

If (u) = M{h{iy\p) : Hk{j^)/k = u} , I^(^i) = mi{h{iy\p) : hk{j^) = u} 

I^{u) = mi{h{j,\p) : D{j,k\pk)/k = u} , I^(«) = M{h{iy\p) : A{i^k\pk) = u} 
where the infima are taken over v d Ms- The infimum over an empty set is taken 
equal to +00 following the usual convention. 

This proposition is a direct consequence of the contraction principle and suggests 
that the rate functions we can expect when we consider kin) growing with n are 
"contracted" relative entropies. Note that the rate functions of Proposition 13.81 
need not be convex. 

From the convexity of I and J we know that they are in Legendre duality with 
the corresponding scaled cumulant generating function for the different empirical 
entropies. In the next two propositions we give the expression of the scaled cumulant 
generating function for empirical entropies and empirical relative entropies. 

Proposition 3.4. Assume that the hypotheses of Theorem \3.1\ hold. Then the rate 
function I is in Legendre duality with the convex function t i— > R(i) , t G K, defined 
as 

(i + l)Ptop(</)/(i + l)) for t>-l 

sup{E^[0] : 77 € £} for t < -I . 
Moreover, 



R(t) 



(3.4) 



lim — log Ep 



ithh. 



)(^r) 



lim — log Ep 



R(t). 



Using (|2.5() it is easy to check that 

R(t) = (i + 1) hm - log V 

n — >oo n ^ — ^ 

This resembles a Renyi entropy. 



p(K])^ 



for t>-l. 



(3.5) 



(3.6) 



Proposition 3.5. Assume that the hypotheses of Theorem 
function J is in Legendre duality with the convex function t 1 
as 



hold. Then the rate 
P'^it), te R, defined 



P^(t) 



def 



(l-t) inf{E^[(/)] -.veS} t > 1 
t<l. 



Moreover, 

lim — log Ep 



^ntAfc 



)ixi\p) 



lim — log Ep 

71^00 ri 



,-°fc(n)(=°"IP) 



P'^(i) (3.7) 



Let us introduce 



hn 



def 



lim h{pi3^). 



(3.8) 



The existence of this limit will be shown below fLemma lG.lll . In general hoo can be 
strictly positive and we stress that it is equal to log|^| for the uniform Bernoulli 
measure. 

In the case when R is a strictly convex, continuously differentiable function on 
] — 1, +oo[, we can improve the results of Thcorem l3.1l A large class of g-measures 
satisfies this property, namely those associated to potentials with square summable 
variations. 



Proposition 3.6 (More on large deviations). In addition to assumptions of The- 
orem \S.1\ assume that the variations of (j) are square summable. Then I is strictly 
convex on [hco,log\A\], with a unique minimum, where it assumes the value 0, at 
u — h{p). Moreover it admits the following representation: 

I{u) ^ h{pf3^^\p^) for ue[haoAog\A\] (3.9) 

where Pu ^ is the unique solution of the equation h{pf}^) = u. On the interval 
[0, /loo] the function I is linear 

I(u) = — u — sup{E,,[(/)] : rj £ S} 



4. Comments on the results 
We make some comments on the above results. 

Zero-temperature limit and non-difFerentiability of R at 1. By using a 
classical formula for the derivative of the pressure JHI , it is straightforward to see 
that the right derivative of t i-^ '^{t) at —1, when the variations of (j) are square 
summable, is equal to 

lim (Ptop(/30)-/3Ep^[0]) 

where we recall that p/j^ is the equilibrium state of the potential /30. By the 
variational principle, we thus get that 

lim — -— = lim h{pR^) = hoo ■ 

This limit is not zero in general, therefore the function R is not differentiable at 
t = — 1. Notice that this is related to zero-temperature limit of equilibrium states. 

About the route to large deviations. Let us emphasize that we prove our 
large deviation bounds directly. Another way to prove large deviation principles is 
to first prove the existence of the corresponding scaled cumulant generating func- 
tion, and then to apply Gartner-Ellis Theorem (see e.g. JJ). To that end one 
needs to prove, e.g., that the scaled cumulant generating function is differentiable 
and strictly convex. We could do that under the assumption that the potential 
of the g-measure has square-summable variations. But, as H3.4II shows, the scaled 
cumulant generating function is not differentiable at —1 in the case hoo 7^ 0. There- 
fore one cannot apply Gartner-Ellis theorem. Notice also that the rate functionals 
of Proposition 13 . 31 can be in general non convex. This means that even in the case 
when k is fixed Gartner-Ellis theorem may not apply. 

We want to stress that with our approach we need not to assume anything on 
the rate of convergence to zero of the variations of the potential. 

On the growth condition H3.1|l . A look at the proof of Theorem 13. II reveals 

that we actually have a little bit more general condition on k{n). In fact we could 

impose, e.g., 

(log n)2U !'=(") 

> as n — > 00 . 

n 

We feel that condition (|3.1() is more appealing and it is related to the condition 

which appears in the laws of large numbers for empirical entropies (see below) . 

Flatness of I. If p is not the unique equilibrium state of 0, it is easy to see 
that the rate function I can be identically zero in some interval containing h{p). 
Indeed, the set of equilibrium states of (fi form a Choquet simplex and the map 
ly 1-^ h{v) is convex affine ^21 on the set of shift-invariant measures. Hence, there is 



an equilibrium state pi (maybe equal to p) such that h{pi) minimizes the entropy 
among all equilibrium states of (f>. It may be not unique but this does not matter: 
we call hi this minimal entropy. We do the same for the maximal entropy and call 
/i2 the corresponding value (maybe equal to h{p)). Then, it is easy to verify that 
I(m) = for all u e [/ii,/i2] since I(/ii) = I(ft'2) = (by the variational principle) 
and I is convex and positive. 

Strong laws of large numbers for empirical entropies. If p is the unique 
equilibrium state of 4> (e.g. when (f> has square-summable variations), then is the 
minimum of I and it is attained only at u = h{p) (this is an immediate consequence 
of the variational principle). We can use Theorem 13.11 and apply Borel-Cantelli 
Lemma to obtain that 

^'^'^ Tr~^^k(n){xi) = lim hk(n){xi) ^ h{p) p-a.s. 

Therefore, we recover in our context the Ornstein- Weiss almost-sure result cited in 
the introduction, with a k{n) allowed to grow a little bit less fast and stronger hy- 
potheses on the source p. A similar statement, in probability, can be deduced from 
the results of |lf>j . The almost-sure convergence of conditional empirical entropy 
in the case of an ergodic measure v with positive entropy can be proved under the 
condition that k{n) < j^^ logn (and k{n) — > oo), for some < e < 1. If e = 0, 
this almost-sure convergence fails in general |24j. 

The same argument applied to the statement of Theorem l3 . 2l leads to the almost- 
sure convergence of empirical relative entropies to zero 

lim As.(„)(2;"|p) = lim -— -Z)fc(„)(a;"|p) = p-a.s. 

A similar result in probability for y^£'fe(„)(x"|p) appears in jl()| with more as- 
sumptions on k{n). 

Connection with central limit asymptotics. Theorem 13.21 has its own in- 
terest, but it is also connected with the central limit asymptotics of conditional 
empirical entropy JSj as follows. The following decomposition holds (see [TB]'): 

_. n — 1 

hk(n){x^i) - h{p) = --Y^ic^iT^xf) - E, [(/.]) - Akin){x"M+Cn (4.1) 

" 3=0 

where the correction term C„ is such that |C„| < Cvar/j(„) (0) and xf E [a;"]. In 
words, the conditional empirical entropy is equal to the empirical average of the 
potential —0, plus a term due to the conditional empirical relative entropy between 
the empirical measure and the "true" measure, and plus a correction. 

In |16| . the authors assume that the variations of decrease exponentially fast. 
They show, under appropriate assumptions on the way k{n) is allowed to grow, that 
•y/nAfc(„)(a;"|p) goes to zero in p-probability, as well as ^/n Cn- Therefore, they can 
conclude that the central limit theorem for ft.fe(n)(x") — h(p) is equivalent to the 
central limit theorem for — ^ X]?=o 4>{T-'x^) — ^p[—<t>\- In particular, the variance 
is given by 

^ n — 1 

a^^ lim -E4(V(/.(r^x?=)-nEp[0])']. (4.2) 

At large deviation scale it is possible to see that term C„ is irrelevant, but not 

^k(n){Xi\p). 



In fact large deviations for ft.fc(n)(3^i) cire different from large deviations for 
~n^l=o 4'{T-'xf). The latter have the same large deviations as — ^ logp([a:;"]). 
Indeed, it is easy to check (using (|2.5() and (|2.8() ') that for any real t 

$(t) =^'lim -logEp[e-*S"-o*°^1-lim i log V p(K])i-* =Pt„p((l - i)0) . 



n— >oo 71 



n— >oo n ^ — ^ 



The common rate function for (— ^ logp([a;"]))„ and (— ^ Yll^i 4'{T^xf))n is then 
given by the Legendre transform of <I>. 

In 0, it is proved that a'^ = -0(0) = ^^^p^^(O). On another hand, one 
expects that the second derivative of the scaled cumulant generating function at 
(or, equivalently, the inverse of the second derivative at h(p) of the rate function) 
equals the variance (^). Though R(t) ^ $(i) for all i ^ 0, a simple computation 
shows that ^(0) = if%M)(o) ^ a\ 

Therefore, we have distinct rate functions (because the conditional empirical 
relative entropy "correction" contributes at large deviation scale) but their second 
derivative at coincide. 

Remark. Using H4.1|l . the fact that Aj.(„')(x"|p) > 0, and the fact that C„ is 
irrelevant at large deviation scale, it is easy to get that 

R(t) < $(i) Vi > , R(i) > $(t) Vi < . 
5. Some combinatorial tools 

In this section we collect some definitions and lemmas about types, as well as 
a continuity lemma for conditional entropy. These are essential ingredients for the 
proofs of our main results which are in the next section. The proof of the following 
lemmas are given in Section [3 

We call U''{A"-) the subset of Ai'^ whose elements can be obtained as empirical 
measure of sample paths of length n. Formally we set 

^^'=(A")-{l.feeX^3<eA" s.t. iyk{-)^7Tk{-,x'l)}. (5.1) 

The set A" of sample paths x" can be partitioned into equivalence classes called 
types. The equivalence relation r^k is defined as 

xr~fcyr^^fc(.;x5')=^fc(-;2/r)- (5-2) 

Let us call T''{A") = A"-/ ~fc the quotient space. Elements of T'^(A"-) are labeled 
with the corresponding empirical measure 7rfc(-;x"), this means that there is a 
bijective correspondence between T''{A") and U^{A"'). We call t^^ ,^ G T^{A"-) the 
type corresponding to 7rfc(-;a;i)- 

We recall that with £^ we indicate the extremal elements of tVJj. 

Lemma 5.1. Given a measure v^ S £^ then hkivk) — 0. 

Lemma 5.2. Given a measure v^ g M^ there exists a measure fik € W''(A") such 
that 

ll/i/c-^/cllt. = V W{a\)-Vk{a\)\< (^ + ^)l^l' (5.3) 

"^-^ n 

a'leA'' 

Lemma 5.3. The following inequalities hold 

|r'=(A")| = |Z^'^(A")| < (n+ 1)1^1' (5.4) 

IKi er^.„}|<(n-l)e"'"=('^'='") (5.5) 



Notice that this does not imply a central limit theorem even under real analyticity, see |5]. 
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IW e T^,J\ > (en)-2|^l%«"^(-^,") (5.6) 

Lemma 5.4. We have the following continuity property of the conditional k-block 
entropy: 

sup \hkiiyk)-hkink)\<-'26\og-—-^ (5.7) 

provided that d < e^^ . 

6. Proofs of main results 
6.1. Proof of Theorem 13. IL Consider a closed set C C R. We have 

p{x-,: hix-,)ec}= J2 /^([^D- 

{x^:hk{x'^)GC} 

From (|2.5(l and H2.6(l we get 

p(K]) = e"{^-'=<-r''^^l}^fc,„(xr) (6.1) 

where 

g-n(e„+var,(,^)) ^ ^^^„(2:!|«) < g"(e„+varfc(0)) ^ (g 3) 

Hence we have 

{2;5':/ifc(a;5')eC} 
gn(e„+var.(0)) X ^ ||^„ ^ ^^^ j| g«{E,,_J0,]} 

{^fc,^e«'=(^"):'i»=('^fc,r.)eC} 
where we have used types defined in Sectional Let us call 

KHC) ""= {i^k & M': : hkiiyk) e C} and h-^iC) ''=^ {^i e Ms : hi^^) e C} . 
Using inequalities (|5.4(l - (|5.5(l we obtain the following upper bound 

g„(e„+var.(0))(,^^^)|A|'=(^_l) g^p/ I ^^^ (E„ J0fe] + /l, (t.,)) l ) . (6.3) 

If we consider sequences (A:(n))„gN that satisfy the growth condition H3.1|l we obtain 
limsup -log pi a;" : /ifc(„)(a;") £ C^ < limsup sup (e^^ [cjik] + hk{vk 



°o Ukeh-'[C) 



We will prove that for any e > there exists an integer K such that for any k > K 
and for any Vk G h'f,'^{C) there exists a /i e /i^^(C) such that 

E., [0fc] + /ifc(i^fc) < Hp) + E^ [0] + e . (6.4) 

The arbitrariness of e will imply the first statement of the theorem. 

To prove formula (|6.4|l we have only to take p, as the unique (fc — l)-step Markov 

extension of Vk and K such that var^f ((/)) < e. 

Let us now prove the lower bound. Consider an open set OCR. 

E /^([-^i]) ^ 

{a:J:ftfc(x5')GO} 

-n(s„+var,(0)) ^ ^ |{^« g ^^^ J| e"K^.J^'=l} (6.5) 

{^fc,„ei^'=(A"):/ifc(7rfc,„)eO} 
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Using inequality (|5.6() we obtain 

g-n(e„+varfc(,^))^g^-)-2|A|'= V^ g"{h,(7r,,„)+E,,_„ [0,-]} ^ 

{^fc,„ewH'4"):/ifc('^fe,„)eo} 

\ [{'^keh-\o)nu''(A")} 

If we consider sequences {k{n))n^fi which satisfy the growth condition (|3.1(l we 
obtain 

hminf-logp|a;'/ : /ife(„)«) G o| > 

liminf sup (E^, [(/)fe(„)] + /ife(„)(j/fe(„)) ) . 

{''Mn)G/i-('„)(0)nW'=(")(A")| 

We wiU prove that for any e > and for any fj, e h^^{0) there exists a TTk(n),n G 
/i^(^„)(0) nZ^'=(")(A") such that 

The arbitrariness of e iniphcs the second statement of theorem 13. II 

When n is large enough |/ife(„)(/ife(„)) — h{pL)\ can become arbitrarily small and 

from lemmas 15.21 and lS. 41 if (i„ — (A;(n) + 2)J — ■ , there exists a measure T^k(n).n G 

^fc(n)(-^n-) such that 

\hk(n){P'k(n)) - hk{n){'^k(n),n)\ < -2rfn log , .|^„) ' 

For a sequence (fc(n))„gN which satisfy the growth condition (|3.1() both dn and 
— 2(i„ log .'^fc'(„) converge to zero. Since O is an open set we obtain that if n is large 

enough there exists a 7rfe(„)^„ G /i^(^„^(0)nW'=(")(^")) and such that |/ife(n)(7i"fe(„),„)- 
/i(/z)| is arbitrarily small. It is also easy to show that 

|E^(0) -E^^^(^j_^((/)i,(„))| < varfc(„)((^) + d„ || (?!)|joo ■ 
The statement easily follows. 

The proof for the estimator ''fefral ^® analogous; we will only point out the 
differences. 

For the upper bound we need to prove that for any e > there exist a K such 
that for any k > K and for any i>k G -M^ with ''^''' € C, there exists /i G Ms 
with h{fj.) G C and such that inequality (|6.4() holds. This can be done considering 

/i = — — ^~^j where v^^ G Als is the unique {i — l)-step Markov extension of Vi. 
Due to the fact that h is afhne on Ms, we have in fact that /i(/i) = —tipil. 

The proof of the lower bound is similar. We omit the details. 

The convexity of I follows from the fact that the maps h{-), h{-\p) : Ms ~> K are 
afEne. Given v G Ms such that /i(z/) = x and fj, G Ms such that /i(/x) — y, then for 
any c G [0, 1] 

h{cv + (1 — c)^) = ex + {1 ~ c)y 

h{cv + (1 — c)/x|/9) — ch{i'\p) + (1 — c)h[p\p) . 
This implies that 

l{cx + (1 - c)y) < h{cv + (1 - c)p\p) = ch{v\p) + (1 - c)h{p\p) (6.6) 
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If we take the infimum over all v g M.s such that h{v) = x and /i G Aig such that 
h{^) = y from 1)6. 6|1 one obtains the convexity of I. 
Theorem ll-{. II is proved. 

6.2. Proof of Theorem 13. 21 The proof of theorem 13. 21 is similar to that of The- 
orem |01 so we leave the details to the reader. 

6.3. Proof of Proposition 13.31 Let us recall the following large deviation prin- 
ciple [Sj. Let a;" be a sample path distributed according to a ^-measure p. Then 
the empirical process 7r(-;a::") defined at (|2.11() satisfies a large deviation principle 
in [Msidw) with normalizing factor — and rate function 

r{v)^h{v\p). (6.7) 

Here d^ is a distance that metrizes weak convergence. 

Now we observe that for every fixed k the entropies upon consideration are 
continuous in {AAg, d^)- Therefore, the contraction principle jll| immediately yields 
the proposition. 

6.4. Proof of Proposition 13731 We prove that the Legendre transform of I is R. 
We know from Theorem l3. ll that I is a convex function and this imply the Legendre 
duality. 

We have 



ljO|l = (t + 1) sup |e^ 



t + 1 



sup Itu- inf /i(iy|p) ^ = sup {E^ [(I)] + th{v) + h{iy)} . (6.8) 

If i > — 1 , then we get by applying the variational principle 

yejvis V. L'' + -"^J J 

It t < —1, we get 

EH = (t + 1) inf (e, [-^1 + h{v] 

Observe that h{v) > for all i/ E Ms- Moreover, the set of measures with entropy 
is dense in Ms (wrt weak topology), see e.g. |^. Hence, for t < —1, (|6.8|) = 
{t + 1) mi{Er,[<j)/it + 1)] : r/ G Ms}. The case < = -1 is trivial. 

The identification of R(t) with the scaled cumulant generating functions (formula 
()3.5|l ) follows from general arguments |llj . 

It is interesting to notice that using the combinatorial properties of types and 
the results of Section El it is possible to prove (|3.5(l directly. We just sketch the 
proof. 

Following arguments already used in the proof of Theorem 13 . II we can obtain 



and 





-log 
n 




ithfc( 


Ti)(a;" 


^ P(Ki 


])< 




sup 


J^" 


„) [<^Mn 


)] + 


it + 


l)/ifc(n 


)i'^k{n) 


)}+Rn 




-log 
n 




tt'hki 


Ti)(a;" 


^ P(K 


)> 




sup 


(A") 


"). 


+ it 


+ l)hk( 


n) {l^ki 


n))}+R 



(6.9) 



(6.10) 



where i?„ and i?„ are correcting terms converging to zero. 
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We now compute the supremum in (|6.9|l . 

If t < —1, the function to be maximized is convex and the supremum is attained 
at one of the extremal points of A^^, which has entropy zero by virtue of lemma 
15.11 Hence the supremum in question equals 

sup{E,,^„, [0fe(„)] : Ukin) G f '^"'} ■ (6.11) 

If i > —1, the supremum in (|6.9|l is equal to 

4>k{n) 



(i + 1) sup 
ueMs 



t + 1 



+ M^)^-(^ + i)Aop(^ 



To see this, we first notice that if v is the (fc(n) — l)-step Markov measure having 
Vkin) s-s fc(n)-marginals, then hk{i'k{n)) = h{v). On another hand, the variational 
principle tells us that Ey[(/)j./„)/(t + 1)] + h{v) attains its supremum precisely at a 
unique {k[n)— l)-step Markov measure because (j)kin) is a fc(n)-cylindrical function. 
This supremum equals Ptop{4>k(n) / (t + !))• 

It is not difficult to prove now that the limit when n — > c» of the upper bound 
coincide with R(t) . Using the results of Section El it is also possible to prove that 
the lower bound has the same lirnit. 

The result for the estimator — '°w\ '- can be deduced from the previous result 

using the fact that (hi{xi))i is a bounded decreasing sequence and 

fc 
^feW) = I]^.«)- (6.12) 

6.5. Proof of Proposition IsTsl The proof of this proposition is very simple and 
left to the reader. It is possible to get \6.7\\ directly using the combinatorics of 
types. 

6.6. Proof of Proposition |2ini When the variations of cj) are square summable 
the map f3 i-^ Ptop{P(j)), /3 G M, is continuously differentiable and strictly convex. 
This can be deduced from |2S]; The extension of their proofs to the square summable 
case is straightforward. This imply that the map R is continuously differentiable 
and strictly convex in the interval (— l,oo). Moreover R(0) — and ^(0) = h{p). 
This establishes the first part of the proposition. 

We now turn to prove the representation formula 13. 9|) . First introduce the 
following auxiliary function of /? € [0, +cx)): 

I{I3) "^ xni{h{u\p) : v G Ms,h{i^) = h(pp^)} . 

We now claim that I(/3) = /i(p^0 |p) . The proof is by contradiction of the variational 
principle. Assume that r] ^ p^^ is such that 

h{j]\p) < Hpp<p\p) and h{i]) = h{p/3^) . 
This means that (remember (|2.1()(l 'l 

Multiplying this inequality by /3 > and adding h{ri) to the Ihs and ^(p/j^) to the 
rhs (since these two quantities are indeed equal by hypothesis) yields 

E^m + h{ri) > E,,, [134,] + h{pp^) . 

But the variational principle tells that the rhs is equal to the supremum over all 
shift-invariant measures ly oi¥,^[P(j)\ + h{i') and is attained only for v = p^^. There- 
fore r] must be equal to p^^. In this instance of the variational principle, we used 
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the fact that if a potential (j) has square summable variations, then /30 also has 
square summable variations, in particular for any /3 > 0. (^) 

We now invoke lemma IHTI hereafter to define a map H : [0, +oo[^]/ioo,log |A|] 
defined as 'H{f3) — h^pp^f,). Since this map is continuous, strictly decreasing, to each 
u G]^oo, log |A|] we can associate a unique (3u such that h{pp^) = u. 

The last statement of the proposition follows from the first comment in Section 

H n 

We state and prove the lemma used just above. 

Lemma 6.1. Assume that (f> has square summable variations (hence so has ficj) for 
all /3 G IR.^ and is not cohomologous to a constant (^). Then the map (3 i— > h{pp(f,) 
is continuous, strictly decreasing on [0, +oo[ and h{pfjrj,) s]/ioo, log |^|]- 

Proof. By the variational principle, h{pp^) — Ptop{(3(/)) — /3Ep^^ [0]. (This shows 
continuity.) (3 i— > PtopiPcf) is strictly decreasing (since (/> < 0) and strictly convex 
(see above) . This strict convexity of the pressure can be translated as follows ^^1 

Therefore we get that (3 i-^ h{pp^) is strictly decreasing when /3 > 0. It is obvious 
from the variational principle that h{pfj^) — log|^| when /3 = 0. Since h{pf3^) is 
bounded from below by 0, /loo = lini/3^+oo h{pi3^) exists. This ends the proof of 
the lemma. D 

7. Proof of some lemmas 

This section contains the proof of the lemmas of Sectional 

Let us introduce the following graph theoretical representations that we will use 
in the proofs of the lemmas. We call A/',j' the set of integer- valued maps iV^ : A*^ ^ N 
such that 

E ^n{4)=n (7.1) 

and 

Y,Nl(a\-'h) = Y,Nl{ha\-') ya\~' e A^-\ (7.2) 

bGA beA 

Let £^j be the subset of M. ^ whose elements are obtained by normalizing elements 
inW^, i.e. 

Ci = 1^, e M^ : 3< e K s.t. M-) - ^} • (7.3) 

If fc = 1 then U^ (A" ) = Ci, otherwise a strict inclusion holds U'' (A^) C C'^ {n > 1) . 
We will call a /c-order compatible balanced directed multigraph (fc-multigraph, 
fc-M, for short) a directed multigraph with the following properties: The vertices 
are labeled with elements of A''~^; For each vertex the number of outgoing arrows 
is equal to the number of ingoing arrows; An arrow can go from the vertex ai~ to 
the vertex b^~^ if and only if 02^^ ~ ^i~^ ■ This arrow inherits the natural label 
a-y~^h^~^ (note that several arrows can have the same label). 



In case of non-uniqueness, the claim still holds but p^^ is any equilibrium state associated to 
/30 since relative entropy only depends on I3cf>. 

I.e. is not the equilibrium measure for a potential of the form V — V o T + c, where 1/ is a 
measurable function, c E M. In this case the equilibrium measure would coincide with the measure 
of maximal entropy, the uniform Bernoulli measure. 
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Given an element N^l e Af^ we represent it with a k-M containing n arrows (JJ, 
section II. 2) drawing Nf^{bi) directed edges from the vertex associated to b^~^ to 
the one associated to 62. 

Conversely, given a fc-M containing n arrows, then it is possible to associate to 
it an element oi M^ defining N^{a\) as the number of arrows going from a^""^ to 
a\. This gives a bijective correspondence. 

To each element Vk G U^{A"'), we associate the element iV^ = nv^ G M^^. Then 
we construct a /c-M as before, which is connected (note that we are not considering 
vertices without ingoing/outgoing arrows). Given two vertices Oj^^^ and 6j^^ which 
have some ingoing/outgoing arrows, there exist i < j with |i — i| < n such that 
x^'*' ~^ = ai~^ and Xj '^ — b^^^. This means that for any i < I < j there exists 
at least one arrow with label x\'^'^^^ , i.e., at least one path from the vertex ai~^ to 
the vertex 6^"^ 

Conversely given a connected fc-M we associate to it an element oiU''{A"). A 
connected fc-M has at least one Eulerian circuit (see for example W section 1.3). 
One follows the circuit generating a sample path in the following way: Every time 
one goes through an arrow with label af , one concatenates the element a^. The 
sample path a;" that you obtain in this way is such that n7r(-;a;") has associated 
the connected fc-M one started with. 

This is a bijective correspondence between U''{A'") and the subclass of connected 
fc-M containing n arrows. This correspondence says that it is possible, starting from 
the fc-M associated to an element TVk.n G U'^{A'"'), to construct an element x" e t-^^ „ 
by simply following an Eulerian circuit. 

Some classical combinatorial arguments allow to estimate the number of Eulerian 
circuits of a fc-M and this gives an estimate on the number of samples x" G t^^. ^^ 
(see JJj section II. 2): 



<IWGT^..„}|<n- 



(7.4) 
We will call a fc-order weighted compatible balanced directed graph (fc-weighted 
graph, fc-WG, for short) a directed graph with the following properties: The vertices 
are labeled with elements of A'^"^ ; To each arrow is associated a nonnegative weight; 
For each vertex, the sum of the weights associated to outgoing arrows is equal to 
the sum of the weights associated to ingoing arrows; An arrow can go from the 
vertex ai~^ to the vertex &j^~^ if and only if 03"^ = &i~^; The total sum of the 
weights is 1. 

Given a measure i//c € A^^ we can represent it by a fc-WG and conversely given 
a fc-WG we can associate to it an element of Al^. 

7.1. Proof of lemma I5.1L A convex combination of measures corresponds to a 
fc-WG with a convex combination of weights. Therefore the extremality property in 
7W5 corresponds to the extremality property in the set of fc-WG's. Consider a fc-WG 
having nonzero weights only on arrows forming a single cycle (a loop of successive 
arrows visiting a vertex no more than once). All the nonzero weights are equal to 
J, where i is the length of the cycle. Every such a fc-WG cannot be obtained as a 
convex combination of other fc-WG's. Otherwise at least one of them would violate 
one of the conditions to be a fc-WG. Moreover any fc-WG can be obtained as a 
convex combination of a finite number of fc-WG's consisting of a single cycle. A 
decomposition can be obtained by iterating a finite number of times the following 
procedure. Take the(an) arrow to which is associated the minimum weight and 
consider a cycle containing it. Substract the minimum weight to all the arrows 



16 



belonging to the cycle and add the k-WG consisting of the single cycle weighted by 
"^'^' , where m.w. — minimum weight, to the convex decomposition. This gives a 
complete characterization of £'"'. A direct consequence is that hk{vk) = for every 
Vk € S^ ■ This is because for every measure Vk with associated a fc-WG consisting 
of a single cycle Vk{ak\ai~ ) can be only zero or one. The lemma is proved. 

7.2. Proof of lemma I5.2L Given a measure Vk E Ai'^ it is possible to construct 
a measure flk G ^n such that ||/ifc — J^/cHti; < ^^. This is trivial when k = 1 
and a little bit more tricky when k > 1 because of the stationarity condition 12.1|l . 
Consider for any arrow from a^~ to a| the following parameter 



7K) 






i^k{af) 



(7.5) 



where [•] represent the integer part. Take the (an) arrow with associated the mini- 
mum value of 7. Consider an elementary cycle containing a'l and add or subtract 
(depending if the minimum value in H7.5() was obtained with the first or the second 
argument) the value 7(aJ) to all the elements of the cycle. Fix the values of all the 
weights whose value is — with i some integer number < i < n, and remove them 
from the k-WG. It is easy to see that one can iterate this procedure up to fix all the 
values of the weights. One ends up with some weights which satisfy the stationarity 
condition but are not necessarily normalized to one. One concludes the procedure 
by adding or subtracting the weight necessary to have the wanted normalization. 
One can do this sequentially by using an elementary unit of weight — and adding 
or subtracting one unit of weight at the same time in elementary cycles, so that the 
stationarity condition is preserved. This is always possible. The measure jlk that 
is obtained in this way belongs to £jj and is such that 

21^1'= 

Wfj-k - I'kWtv < 

n 
If the fc-M corresponding to jlk is connected then the proof is finished. If the 
fc-M associated to flk is not connected let m > 1 be the number of connected 
components containing respectively e(l), • • • , e{m) directed edges with J2T=i ^iJ) — 
n. Considering an Eulerian circuit for every component one can associate a sample 
path s{i) of length e{i) to the ith component for i — 1,- ■ ■ ,m. The measure flk has 
the following expression 

~^ki-)^f2^^ki-,sU)). (7.6) 

j=i 
Let us now consider the sample path s = s(l)s(2) • • • s{m) of length n and construct 
Ilk as the k empirical measure fJ,k{-) = T^ki'-s) G U'^{A^^). Both fXk and jlk are 
constructed by sliding windows of width k along cyclicized samples, and computing 
frequencies in these windows. Every times the window of size k is overlapped to the 
sample s and do not cross points of separation among different s{i) the fc-sequence 
that is matched contributes both in jlk and fik- Using the fact that m < A'^^^ we 
deduce 

ll/ifc-Afcllt. <^ (7.7) 

n 

Using (|7.t)|l . (|7.7|) and the triangle inequality yields the statement of the lemma. 

7.3. Proof of lemma I5.3L The proof of inequalities H5.4|l and H5.5|l is very simple 
and elegant and can be found in 23 . More precisely in section I.6.d in the case 
of non cyclicized samples and in section Il.l.a in the case of cyclicized samples, 
which is our case. The proof of inequality 1)5. 6|l is obtained from estimate (I7.4f) and 
Stirling formula. Inequality H5.5() can be proved in an analogous way. 
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7.4. Proof of lemma |5. 41 This lemma can be found in TU" but we give its proof 
thereafter for the sake of completeness. Consider /i^ and I'k two measures in Ai'' 
such that II i^k — ^J-k \\tv< S. Let us set (5fc(aJ) = \i'k{a'l) — fikiai)\. Obviously 

Using triangle inequality one obtains 
Ihkii'k) - hk{i.J.k)\ <E \iyk{ai)\ogiyk{ai) - ^ik{ai)\ogfik{ai)\ 

+ H \vk-i{a\-^)\ogVk-i{a'l'^) ~ ^ik-l{a'l-^)^og^ik-l{a'l-^)\ . 
"■1 
By a simple computation it is possible to obtain the modulus of continuity of the 
function — xlogx on the interval [0, 1] when d is small enough 

sup |a;logx — y logyl = — 51og(5 . 

{x.ye[OA]:\x-y\<S} 

Using this result we get 



We write the right hand side of (|7.8|l as 

'\A\'^T/-^log6k{a1)~\Ar' V "'-\\':'_^ ' logSk^M-') (7.9) 



M^^l^o-A ('^'^^ I A\k-i V^ Sk-ijai ^ ) 



and apply Jensen inequality using the fact that — xloga; is a concave function. 
When 5 is small enough we finally obtain 

A A 

\hkii^k) - hkifMk)\ < -6\og-^-5log 

The lemma is proved. 
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